Search CORE

334 research outputs found

An analysis of extensible modelling for functional genomics data

Author: Jones Andrew R
Paton Norman W
Publication venue: BioMed Central
Publication date: 01/09/2005
Field of study

BACKGROUND: Several data formats have been developed for large scale biological experiments, using a variety of methodologies. Most data formats contain a mechanism for allowing extensions to encode unanticipated data types. Extensions to data formats are important because the experimental methodologies tend to be fairly diverse and rapidly evolving, which hinders the creation of formats that will be stable over time. RESULTS: In this paper we review the data formats that exist in functional genomics, some of which have become de facto or de jure standards, with a particular focus on how each domain has been modelled, and how each format allows extensions. We describe the tasks that are frequently performed over data formats and analyse how well each task is supported by a particular modelling structure. CONCLUSION: From our analysis, we make recommendations as to the types of modelling structure that are most suitable for particular types of experimental annotation. There are several standards currently under development that we believe could benefit from systematically following a set of guidelines

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The University of Manchester - Institutional Repository

Source Selection Languages:A Usability Evaluation

Author: Abel Edward
Galpin Ixent
Paton Norman W.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

Crossref

The University of Manchester - Institutional Repository

Deep Clustering for Data Cleaning and Integration

Author: Freitas Andre
Paton Norman W.
Rauf Hafiz Tayyab
Publication venue
Publication date: 22/09/2023
Field of study

Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the impact of DC on mainstream data management tasks remains unexplored. In this paper, we address this gap by investigating the impact of DC in data cleaning and integration tasks, specifically schema inference, entity resolution, and domain discovery, tasks that represent clustering from the perspective of tables, rows, and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. However, we observed a significant correlation between the DC method and embedding approaches for rows, columns, and tables, highlighting that the suitable combination can enhance the efficiency of DC methods.Comment: The following enhancements have been carried out in the updated version of the manuscript: *Evaluated each data integration problem on additional datasets. *Added more DC and SC methods to the evaluation *Discussed algorithmic-specific observation

arXiv.org e-Print Archive

A critical and Integrated View of the Yeast Interactome

Author: Cornell Michael
Oliver Stephen G.
Paton Norman W.
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2004
Field of study

Global studies of protein–protein interactions are crucial to both elucidating gene function and producing an integrated view of the workings of living cells. High-throughput studies of the yeast interactome have been performed using both genetic and biochemical screens. Despite their size, the overlap between these experimental datasets is very limited. This could be due to each approach sampling only a small fraction of the total interactome. Alternatively, a large proportion of the data from these screens may represent false-positive interactions. We have used the Genome Information Management System (GIMS) to integrate interactome datasets with transcriptome and protein annotation data and have found significant evidence that the proportion of false-positive results is high. Not all high-throughput datasets are similarly contaminated, and the tandem affinity purification (TAP) approach appears to yield a high proportion of reliable interactions for which corroborating evidence is available. From our integrative analyses, we have generated a set of verified interactome data for yeast

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

The University of Manchester - Institutional Repository

A hierarchical decentralized architecture to enable adaptive scalable virtual machine migration

Author: Hummaida Abdul R.
Paton Norman W.
Sakellariou Rizos
Publication venue: 'Wiley'
Publication date: 25/01/2023
Field of study

The University of Manchester - Institutional Repository

Deep Clustering for Data Cleaning and Integration

Author: Freitas Andre
Paton Norman W.
Rauf Hafiz Tayyab
Publication venue: OpenProceedings
Publication date: 21/12/2024
Field of study

Deep Learning (DL) techniques now constitute the state-of-theart for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the potential of DC for data management tasks remains unexplored. In this paper, we address this gap by investigating the suitability of DC for data cleaning and integration tasks, specifically schema inference, entity resolution and domain discovery, from the perspective of tables, rows and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. Experiments also show consistently strong performance compared with state-of-the-art bespoke algorithms for each of the data integration tasks

The University of Manchester - Institutional Repository

Voyager: Data Discovery and Integration for Data Science

Author: Bogatu Alex
Douthwaite Mark
Freitas Andre
Paton Norman W.
Publication venue
Publication date: 23/03/2022
Field of study

The University of Manchester - Institutional Repository

Fairness in Data Wrangling

Author: Fernandes Alvaro A.a.
Konstantinou Nikolaos
Mazilu Lacramioara
Paton Norman W.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/09/2020
Field of study

Crossref

The University of Manchester - Institutional Repository

SBRML: A markup language for associating systems biology data with models

Author: Dada Joseph O.
Mendes Pedro
Paton Norman W.
Spasić Irena
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/02/2010
Field of study

MOTIVATION: Research in systems biology is carried out through a combination of experiments and models. Several data standards have been adopted for representing models (Systems Biology Markup Language) and various types of relevant experimental data (such as FuGE and those of the Proteomics Standards Initiative). However, until now, there has been no standard way to associate a model and its entities to the corresponding datasets, or vice versa. Such a standard would provide a means to represent computational simulation results as well as to frame experimental data in the context of a particular model. Target applications include model-driven data analysis, parameter estimation, and sharing and archiving model simulations. RESULTS: We propose the Systems Biology Results Markup Language (SBRML), an XML-based language that associates a model with several datasets. Each dataset is represented as a series of values associated with model variables, and their corresponding parameter values. SBRML provides a flexible way of indexing the results to model parameter values, which supports both spreadsheet-like data and multidimensional data cubes. We present and discuss several examples of SBRML usage in applications such as enzyme kinetics, microarray gene expression and various types of simulation results

Online Research @ Cardiff

The University of Manchester - Institutional Repository

Placement of Workloads from Advanced RDBMS Architectures into Complex Cloud Infrastructure

Author: Bostock Clive
Embury Suzanne M.
Higginson Antony S.
Paton Norman W.
Publication venue
Publication date: 23/03/2022
Field of study

The University of Manchester - Institutional Repository